Skip to content

EEG-GCNN: Dataset, Task, Models, and Neurological Disease Detection Pipeline#1059

Closed
jburhan wants to merge 1114 commits intosunlabuiuc:masterfrom
jburhan:master
Closed

EEG-GCNN: Dataset, Task, Models, and Neurological Disease Detection Pipeline#1059
jburhan wants to merge 1114 commits intosunlabuiuc:masterfrom
jburhan:master

Conversation

@jburhan
Copy link
Copy Markdown

@jburhan jburhan commented Apr 21, 2026

Contributors

  • Jimmy Burhan — jburhan2 (Dataset + Task)
  • Robert Coffey — rc37 (Models + Training)

Type of Contribution

Option 4: Full Pipeline — Dataset + Task + Model (60 pts)

Paper

Wagh & Varatharajah (2020) — EEG-GCNN: Augmenting Electroencephalogram-based Neurological Disease Diagnosis using a Domain-guided Graph Convolutional Neural Network, ML4H @ NeurIPS 2020

Summary

End-to-end EEG-GCNN pipeline for neurological disease detection from EEG recordings, replicating and extending Wagh & Varatharajah (2020).

Dataset & Task (Jimmy Burhan — jburhan2)

  • EEGGCNNRawDataset — raw EEG loader for TUAB (1,385 clinically-normal patients, label 0) + LEMON (208 healthy volunteers, label 1): resamples to 250 Hz, 1 Hz high-pass + 60/50 Hz notch filter, 8 bipolar channels, 10 s non-overlapping windows, 6-band Welch PSD features (delta/theta/alpha/beta/low_gamma/high_gamma), writes 5 FigShare-format arrays via precompute_features()
  • EEGGCNNDataset — FigShare pre-computed dataset (1,593 subjects, 225,334 windows); builds graph adjacency via tunable blend of geodesic distance and spectral coherence: edge_weight = α·geodesic + (1−α)·coherence
  • EEGGCNNDiseaseDetection — per-sample streaming task with configurable adjacency type (spatial / functional / combined) and band subset; available as an alternative on-the-fly entry point
  • EEGGCNNClassification — window-level classification task for FigShare data; supports excluded_bands for spectral ablation

Models & Training (Robert Coffey — rc37)

  • EEGGraphConvNet — two-layer GCN classifier (reproduces paper's Shallow EEG-GCNN, float32)
  • EEGGATConvNet — two-layer graph attention network classifier (float64)
  • Training pipelines, holdout evaluation scripts, band frequency ablation script
  • 83 unit and integration tests across 4 test files — synthetic fixtures for fast unit coverage + real EDF/BrainVision integration tests via TestRawDatasetPrecompute

Ablation Studies (Original Extensions)

Ablation 1 — Adjacency Type (α sweep, GCN)

Varied α ∈ {0.0, 0.25, 0.5, 0.75, 1.0} on the full FigShare dataset (477 test patients, 70/30 split, 10-fold CV). Paper only uses α=0.5 — this is our novel contribution:

α auroc_patient bal_acc f1
0.0 (coherence only) 0.8841 ± 0.0132 0.8087 ± 0.0130 0.8222 ± 0.0176
0.25 0.8943 ± 0.0072 0.8069 ± 0.0122 0.8385 ± 0.0162
0.50 (paper default) 0.8984 ± 0.0102 0.8276 ± 0.0122 0.8399 ± 0.0281
0.75 0.8991 ± 0.0066 0.8209 ± 0.0051 0.8491 ± 0.0323
1.0 (geodesic only) 0.7891 ± 0.0471 0.7406 ± 0.0349 0.8211 ± 0.0599

α=1.0 (spatial-only) collapses performance — functional connectivity is essential. α=0.75 best for auroc_patient and F1; α=0.50 best for balanced accuracy.

Ablation 2 — Spectral Frequency Analysis (GCN, inference-only)

Leave-one-out (LOO) and keep-one-in (KOI) across all 6 bands using trained GCN checkpoints — no retraining required. Low gamma is the single most important band (removing it drops AUROC from 0.898 → 0.594; keeping only it recovers 0.723). All bands make a meaningful contribution — none is redundant. Numerical results in project slides.

Best Results (GAT, tuned learning rate)

auroc_patient bal_acc f1 precision recall
0.9415 ± 0.0251 0.8854 ± 0.0322 0.9201 ± 0.0231 0.9855 ± 0.0089 0.8638 ± 0.0409

GAT with tuned LR outperforms GCN on every patient-level metric.

File Guide

File Description
pyhealth/datasets/eeg_gcnn.py EEGGCNNDataset — FigShare pre-computed loader
pyhealth/datasets/eeg_gcnn_raw.py EEGGCNNRawDataset — raw TUAB + LEMON loader with precompute_features()
pyhealth/datasets/configs/eeg_gcnn.yaml FigShare dataset config
pyhealth/datasets/configs/eeg_gcnn_raw.yaml Raw dataset config
pyhealth/tasks/eeg_gcnn_classification.py EEGGCNNClassification — window-level task for FigShare path
pyhealth/tasks/eeg_gcnn_disease_detection.py EEGGCNNDiseaseDetection — streaming task for raw path
pyhealth/models/eeg_gcnn.py EEGGraphConvNet (GCN)
pyhealth/models/eeg_gatcnn.py EEGGATConvNet (GAT)
examples/eeg_gcnn/eeg_gcnn_classification_gcn_precompute.py Precompute: raw TUAB + LEMON → 5 FigShare-format arrays (GCN path)
examples/eeg_gcnn/eeg_gcnn_classification_gcn_training.py GCN training (10-fold CV)
examples/eeg_gcnn/eeg_gcnn_classification_gcn_evaluation.py GCN holdout evaluation (Youden's J threshold)
examples/eeg_gcnn/eeg_gcnn_classification_gcn_band_ablation.py Spectral frequency ablation (LOO + KOI, 13 conditions)
examples/eeg_gatcnn/eeg_gcnn_classification_gat_precompute.py Precompute for GAT path (float64)
examples/eeg_gatcnn/eeg_gcnn_classification_gat_training.py GAT training (tuned LR)
examples/eeg_gatcnn/eeg_gcnn_classification_gat_evaluation.py GAT holdout evaluation
examples/eeg_gcnn/sample_precomputed_data/ 5 FigShare-format sample arrays for quick-start testing
examples/eeg_gcnn/sample_raw_data/ Sample EDF (TUAB) + BrainVision (LEMON) files used by integration tests
tests/core/test_eeg_gcnn_raw_dataset.py 45 tests: EEGGCNNRawDataset + EEGGCNNDiseaseDetection (synthetic + real EDF/BrainVision integration)
tests/core/test_eeg_gcnn_dataset.py 18 tests: EEGGCNNDataset + EEGGCNNClassification (synthetic fixtures)
tests/core/test_eeg_gcnn_model.py 8 tests: EEGGraphConvNet (forward pass, shapes, gradients)
tests/core/test_eeg_gat_model.py 12 tests: EEGGATConvNet (forward pass, dropout configs, gradients)
docs/api/eeg_gcnn_pipeline.rst Full pipeline documentation
docs/api/datasets/pyhealth.datasets.EEGGCNNDataset.rst API doc — FigShare dataset
docs/api/datasets/pyhealth.datasets.EEGGCNNRawDataset.rst API doc — raw dataset
docs/api/models/pyhealth.models.EEGGraphConvNet.rst API doc — GCN model
docs/api/models/pyhealth.models.EEGGATConvNet.rst API doc — GAT model
docs/api/tasks/pyhealth.tasks.EEGGCNNClassification.rst API doc — classification task
docs/api/tasks/pyhealth.tasks.eeg_gcnn_disease_detection.rst API doc — disease detection task

Test Plan

# Raw path — EEGGCNNRawDataset + EEGGCNNDiseaseDetection (45 tests)
pytest tests/core/test_eeg_gcnn_raw_dataset.py -v

# FigShare path — EEGGCNNDataset + EEGGCNNClassification (18 tests)
pytest tests/core/test_eeg_gcnn_dataset.py -v

# Models — EEGGraphConvNet + EEGGATConvNet (20 tests)
pytest tests/core/test_eeg_gcnn_model.py tests/core/test_eeg_gat_model.py -v

# All 83 tests, ~30 seconds, no real EEG data required for synthetic tests
pytest tests/core/test_eeg_gcnn_raw_dataset.py tests/core/test_eeg_gcnn_dataset.py tests/core/test_eeg_gcnn_model.py tests/core/test_eeg_gat_model.py -v

jhnwu3 and others added 30 commits December 2, 2025 18:41
* Add DeepNestedSequenceProcessor and DeepNestedFloatsProcessor for handling deeply nested sequences

* Add unit tests for DeepNestedSequenceProcessor and DeepNestedFloatsProcessor

* Refactor EmbeddingModel to support DeepNestedSequenceProcessor and DeepNestedFloatsProcessor, and support mask outputs

* Refactor AdaCare model to integrate new dataset and processor classes, update input handling, and forward propagation logic

* Remove output mask handling for passthrough tensors in EmbeddingModel forward method

* Add unit tests for AdaCare model including initialization, forward and backward passes, loss checks, and output shapes

* Update mortality prediction example script for MIMIC-III using adacare

* Remove property decorator from size function in StageNetTensorProcessor class
* init commit for 1 solution

* also change ex cache

* init commit

* commit new test case and fixes

* new update

* minor update to the number of workers used, turns out it does make a difference in processing speed

* update again

* update with comprehensive icd medcode vocabs

* fix typo in test cases and stagenet to match other processors

* already tested with full training on holdout sets, etc.
* more typo fixes

* split to make sure things are more clear on how to apply and a separate blog post on the highlights for ml4h
breaking rules, but need to just update this.
* Fix SequenceProcessor

* Fix up

* Fix incorrect test
* init commit for now, will merge later after memory fixes

* fixes to bugs
* Polars fix bug of OOM on large table join

* Fix type hint

* Add cache_dir

* Remove to_lower as this is a no-op

* Add caching behaviour

* Add test case

* Add StreamingParquetWriter

* write samples

* Add SampleBuilder

* Fix Mimic4

* fix incorrect dev mode

* change fit to take Iterator

* update test

* rename

* update fit to use Iterable

* Fix SampleBuilder

* Fix tsv test

* Fix base dataset test

* cache processed data

* save schema for SampleBuilder

* Fix sampledataset

* Fix multi-worker crashes

* update test

* Fix non-pickable

* Fix get_dataloader

* Fix embedding

* support split

* Fix test

* Fix collate_fn

* fix conflicting cache dir

* update test

* add create_sample_dataset to convert list of samples to SampleDataset

* test new dataset

* Update docs

* Fix test

* Fix test

* Fix test

* Fix test

* Fix test

* add InMemorySampleDataset

* support InMemorySampleDataset

* Fix test

* Fix tests

* Fix test

* Fix test

* support set_shuffle for InMemorySampleDataset

* Fix in memory dataset subset

* Fix test

* update sample dataset test

* Add deps

* Fix test

* commit for fixing model docs

* fix adacare docstrings

* fix for python 3.10 override typing incompatibility, but still struggling with problems with out of memory?

* organize for benchmarking scripts

* add deps

* add _scan_csv_tsv_gz

* convert load_table to dask

* convert load_data to dask

* convert global_event_df to dask

* Fix base dataset test

* Fix bug

* Fixup

* Fixup

* fixup

* main guard

* fix incorrect null value handling

* change back to ms to mimic old pyhealth beahviour

* add TODO

* main guard check

* fix nullable issue?

* revert change

* update API layers

* Fix cache test

* support remote url

* fix test

* fix incorrect type in nw

* less worker

* non-dev for memtest

* fix incorrect behaviour on notebook & make sure dask excption throw at dask

* update on installation details and recommended settings for use with PyHealth due to new change in backend

* additional clarifications here

---------

Co-authored-by: Yongda Fan <yongdaf2@illinois.edu>
* new alpha ver. with streaming changes

* minor bug fix towards timeseries processor where in python 3.12, naming of class variables and function names conflict when the same string
* fixes for timeseries processor, bugs in datetime parsing, and other details

* update to examples to run these things

* version update again

* doc updates to be more accurate, changed the error raising, and also added another bug fix to where dask leverages its temporary file usage to be in line with the cache_dir
* add num_workers to BaseDataset

* Fix test

* Fix MIMIC4

* use multi-process mode when not in notebook to speed up the process
)

* Fix incorrect null handling for timestamp

* add more test case
* newsletter and doc updates for memory optimizations

* update install version to use latest

* clarify on what task we're talking about
* Update Agent, PyHealth 2.0

* Address Copilot review: fix 4D tensor handling, reward function, and schema

- Add 4D->3D pooling for NestedSequenceProcessor (sum across codes dim)
- Fix binary/multilabel reward to provide feedback for both classes
- Change schema from 'sequence' to 'nested_sequence' throughout
- Update test samples to use nested visit structure

---------

Co-authored-by: Steier <637682@bah.com>
* Update ConCare model for PyHealth 2.0 with testing

* add example MIMIC 4 demo, ConCare

* fixed ipynb, concare

* Address Copilot review feedback: fix imports, formatting, and add tests

---------

Co-authored-by: Steier <637682@bah.com>
* Update Deepr model to PyHealth 2.0

* Address review feedback for Deepr model

- Add integer type check for window parameter
- Improve error messages with shape information
- Add paper reference and usage examples to docstring
- Add nested sequence test case (4D tensor handling)
- Add MIMIC-IV mortality prediction example notebook

---------

Co-authored-by: Steier <637682@bah.com>
* rename to avoid naming collision

* Support multi-worker for task transformation

* Fix import

* Fix deadlock

* better IPC communication

* fix bug when num_workers == 1

* Fix UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

* Fix crash

* Fix single worker mode.
* Use batch processing

* ensure locality on data access.

* Fix polars threads.
Robert Coffey and others added 28 commits April 19, 2026 18:30
* dl4h final project kobeguo2 - CaliForest

* Update CaliForest to require explicit fit before inference

* Remove unused logit_scale from CaliForest
- Update Robert Coffey's NetID from racoffey2 to rc37
- Rename heading from 'Signal Processing (EEGGCNNDiseaseDetection)' to
  'Signal Processing'; add note clarifying it is implemented in both
  EEGGCNNRawDataset.precompute_features() (batch) and
  EEGGCNNDiseaseDetection.__call__() (streaming)
- Correct PSD band names and ranges to match actual code and FigShare
  arrays: delta/theta/alpha/beta/low_gamma/high_gamma with boundaries
  matching _BAND_RANGES (7.5, 13.0, 30.0, 40.0 Hz)
- Attribute graph adjacency alpha formula to EEGGCNNDataset where it lives
- Fix adjacency description: 'geodesic distance' (not 'arc length on unit sphere')
- Remove non-existent download_lemon.py reference from Quick Start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update Robert Coffey's NetID from racoffey2 to rc37
- Rename heading from 'Signal Processing (EEGGCNNDiseaseDetection)' to
  'Signal Processing'; add note clarifying it is implemented in both
  EEGGCNNRawDataset.precompute_features() (batch) and
  EEGGCNNDiseaseDetection.__call__() (streaming)
- Correct PSD band names and ranges to match actual code and FigShare
  arrays: delta/theta/alpha/beta/low_gamma/high_gamma with boundaries
  matching _BAND_RANGES (7.5, 13.0, 30.0, 40.0 Hz)
- Attribute graph adjacency alpha formula to EEGGCNNDataset where it lives
- Fix adjacency description: 'geodesic distance' (not 'arc length on unit sphere')
- Remove non-existent download_lemon.py reference from Quick Start

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EEG-GCNN: Dataset, Task, Models, and Neurological Disease Detection Pipeline
…processing section

Explain that all training, holdout, and ablation runs use EEGGCNNDataset
(alpha parameter) + EEGGCNNClassification (excluded_bands) — not
EEGGCNNDiseaseDetection. Document the latter as an available streaming
alternative with configurable adjacency_type and bands, but clarify it
was not exercised in the runs described in this document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
End-to-end example script following PyHealth naming convention
({dataset}_{task_name}_{model}.py). Demonstrates loading FigShare
pre-computed EEG-GCNN arrays, applying EEGGCNNClassification task,
training EEGGraphConvNet with 10-fold CV and WeightedRandomSampler,
and reporting patient-level AUROC.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
examples/eeg_gcnn/ and examples/eeg_gatcnn/ already contain the full
set of end-to-end scripts from the eeg-gcnn-contribution branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidates tests/test_eeg_gcnn.py (EEGGCNNDiseaseDetection, raw path)
into tests/core/test_eeg_gcnn_dataset.py alongside the existing FigShare
path tests. Fixes stale import (eeg_gcnn_nd_detection →
eeg_gcnn_disease_detection) and task_name assertion. Total: 41 tests,
all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in post-PR-#3 changes from the EEG-GCNN pipeline branch:
- docs: updated eeg_gcnn_pipeline.rst (band names, adjacency attribution,
  EEGGCNNDiseaseDetection vs EEGGCNNRawDataset clarification)
- examples: renamed scripts to required convention
  (eeg_gcnn_classification_gcn_training.py, _evaluation.py, etc.)
- examples: added sample raw + precomputed data for quick-start testing
- tests: consolidated 41-test suite into tests/core/test_eeg_gcnn_dataset.py
  (merged raw-path tests, fixed stale import, removed tests/test_eeg_gcnn.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds TestRawDatasetPrecompute (12 tests) to test_eeg_gcnn_dataset.py,
exercising EEGGCNNRawDataset.precompute_features() end-to-end against
the real EDF/BrainVision sample files in
examples/eeg_gcnn/sample_raw_data/. Verifies all 5 output files are
created with correct shapes, consistent window counts, valid labels,
and finite PSD values. Total suite: 53 tests, all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…set.py

Separates tests by data path per Robert's feedback:
- test_eeg_gcnn_dataset.py: reverted to original 18 tests covering the
  FigShare path (EEGGCNNDataset + EEGGCNNClassification)
- test_eeg_gcnn_raw_dataset.py: new file with 35 tests covering the raw
  path (EEGGCNNRawDataset + EEGGCNNDiseaseDetection), including a
  TestRawDatasetPrecompute integration class that runs the full
  preprocessing pipeline against committed EDF/BrainVision sample files

Also removes pyhealth/eeg-gcnn-data/ (untracked local data directory).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restores the missing section comment block — file is now identical to
the original tests/core/test_eeg_gcnn_dataset.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Superseded by tests/core/test_eeg_gcnn_raw_dataset.py which covers all
23 original tests plus 12 new precompute_features integration tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Band table was wrong — corrected to match _BAND_RANGES in eeg_gcnn_raw.py:
  beta:       13.0–30.0 → 13.0–16.0
  low_gamma:  30.0–40.0 → 16.0–30.0
  high_gamma: 40.0–50.0 → 30.0–40.0

Also removed reference to download_lemon.py which does not exist in the repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Local data directory should not be tracked in version control.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Superseded by tests/core/test_eeg_gcnn_raw_dataset.py which covers all
original tests plus 12 precompute_features integration tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jburhan
Copy link
Copy Markdown
Author

jburhan commented Apr 22, 2026

Closing to reopen from the correctly named branch jburhan:jburhan-rcoffey-contribution.

@jburhan jburhan closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.